-
-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to instantiate named clusters from jobqueue config #604
Conversation
Thanks @alisterburt for this proposal. I went through the discussion in dask-contrib/dask-ctl#61, and I'm not sure of the outcome. @jacobtomlinson do you think the change here are necessary or should we wait for a PR in |
Thank you for taking the time to go through the PR/discussion @guillaumeeb! I think the correct answer here depends on how you think dask-jobqueue users should ideally configure their clusters moving forwards. Is that through a jobqueue config or a dask-ctl config? There is a valid argument against having both, but...
I see the two PRs (this and the future one in dask-ctl) as orthogonal and complementary unless there is an effort to move to using dask-ctl as the only way to create/manage clusters in the dask ecosystem |
fixed the silly test failure but workflow requires re-approval from a maintainer 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is all it takes in code modifications, clearly I'm in favor of this!
I'm almost surprised the CLUSTER_CLASSES[job_sheduling_system](**config)
actually does the trick! We have consistent naming everywhere in kwargs and yaml entries?
I've left a few comments, and also it would be good to have some documentation about this functionality, could you add some?
Also, I'd like to have @jacobtomlinson approval before merging, at least on the idea!
|
||
Named clusters should specify the appropriate job scheduling system under the | ||
'job-scheduling-system' key in the jobqueue config alongside specific keyword | ||
arguments passed to the JobQueue cluster class on initialisation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arguments passed to the JobQueue cluster class on initialisation. | |
arguments passed to the JobQueueCluster class on initialisation. |
|
||
def test_jobqueuecluster_from_name_for_existing_scheduling_systems(): | ||
for scheduling_system in CLUSTER_CLASSES.keys(): | ||
temporary_config = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is setting this temporary config really useful to this test?
|
||
|
||
def test_jobqueuecluster_from_name_for_custom_cluster(): | ||
temporary_config = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also you don't seem to check the custom config. Wasn't that mean to check that htcondor default config wasn't taken into account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR I see this PR as stepping too far in the direction of implementing cluster lifecycle management and would prefer to see
the broader community collaborating on dask-ctl
rather than building duplicate solutions independently.
I disagree about this being orthogonal to the dask-ctl
discussion. From a project scope and overall Dask maintenance perspective this feels like it shouldn't live in dask-jobqueue
.
The problem this PR solves is "How can admins preconfigure user environments with some cluster templates?". The discussion in dask-ctl
solves the same problem but for everyone.
Note
Dask Gateway also already has a solution for this for HPC. I'm keen to upstream things from gateway into ctl too because gateway has limited scope and use cases.
Note
Dask Labextension also has similar functionality (which inspired thedask-ctl
code) for defining a template for the "New cluster" button. The goal is for this to leveragedask-ctl
in the future instead.
My goal is to try and push as much cluster lifecycle logic as possible into dask-ctl
so that it is all in one place and is easily reusable between subprojects. I dislike the idea of individual deployment projects implementing their own solutions to this problem, and other problems like it, rather than collaborating on a shared tool like dask-ctl
.
I see the two PRs (this and the future one in dask-ctl) as orthogonal and complementary unless there is an effort to move to using dask-ctl as the only way to create/manage clusters in the dask ecosystem
The goal of dask-ctl
is to be a control plane for all Dask clusters. It wont be the only way to do things but it will be the most consistent and convenient way to do cluster lifecycle things.
I doubt dask-jobqueue will ever disallow config through a jobqueue cluster in favour of dask-ctl completely
I agree and I expect in the future that users can still use the individual libraries directly to "pull the cables and move the flaps" if they want to.
I also think cluster creation will always be simplest by using the library directly. The value in dask-ctl
is value add things like discovery, listing, aggregating, templating and single-pane-of-glass views.
dask-jobqueue only config minimises the surface area of dask API users have to understand
this PR will enable discovery of clusters specified in a jobqueue config within dask-ctl
I would push back on this. This PR expands the Dask API here instead of guiding folks to use an existing API. From a user perspective there is still API to learn.
We are on a journey with dask-ctl
where hopefully one day it will be upstreamed to distributed
or made directly available via the dask
namespace like dask.distributed
is today. But to get there we need folks to use it rather than working independently.
My question to @alisterburt is what can we do to make dask-ctl
better and reduce friction there? For example I totally sympathise with your point about the use of the keywords args
and kwargs
in the cluster spec. Maybe we want to change that to something easier to understand in a v2
spec? (this is why I versioned the spec, so we can iterate nicely)
@guillaumeeb @jacobtomlinson thank you both for the reviews! @guillaumeeb as there currently is no consensus around whether this PR is wanted I will hold off on updating for now. @jacobtomlinson I am completely on board with the centralisation of cluster lifecycle management in dask-ctl, one of the main reasons for this PR was to simplify the implementation of dask-ctl discovery in dask-jobqueue in a followup PR 🙂
I think the main problem this PR solves is actually "how can users preconfigure multiple clusters of the same type without leaving jobqueue?". This is currently not possible using the jobqueue config and adding dask-ctl into the mix is a good (i.e. not ad-hoc) solution. I think not being able to do this is a deficiency of the current jobqueue config that should be solved in jobqueue.
This is exciting! It feels like clearly delineating what dask-ctl and dask-foo should be responsible for (in the ideal case) moving forwards would be useful here. I'm still a bit unsure
For my own personal use dask-ctl is working wonderfully, I'm happy creating clusters from yaml paths for now and will be happier creating from names once we get dask-contrib/dask-ctl#61 going. As a developer making things for non-python folks most likely working in HPC, I really like the idea of developing against dask-ctl and creating/discovering dask-foo clusters from there. This is how I imagine the top of programs I will provide to people from dask_ctl import get_cluster, create_cluster
# cluster_name comes in from the outside world
try:
cluster = get_cluster(cluster_name)
except NameError:
cluster = create_cluster(cluster_name) The friction for my use case comes in when asking users to configure a cluster and they need jobqueue and ctl rather than just jobqueue - I'm optimising to keep the set of things users interact with as minimal as possible Overall I am happy to go either way in this repo and will PR to dask-ctl soon regardless but I do think there is merit to this including this PR in jobqueue 🙂 |
I think the core of my comments are that I think creating cluster templates is out of scope for To quote the Zen of Python
I haven't reviewed this PR line by line yet as I wanted to have the high level discussion first. However at a quick glance I don't think this satisfies the In this PR
That looks like a nice example. In from dask_ctl import create_cluster
cluster = create_cluster("/path/to/spec.yaml") # If 'foo' exists it connects to it, if not it creates it # /path/to/spec.yaml
version: 1
module: "dask_kubernetes.operator"
class: "KubeCluster"
kwargs:
name: "foo" Most people only want a client anyway so a one liner would be client = create_cluster("/path/to/spec.yaml").get_client()
# They can still access the cluster object at 'client.cluster' if they want to call 'client.cluster.scale(10)' or something Here's some examples of how you can use from dask_kubernetes.operator import KubeCluster
cluster = KubeCluster(name="foo") # If 'foo' exists it connects to it, if not it creates it (most users want this)
# Users who don't want this can control creation/connection
cluster = KubeCluster.from_name("foo") # If 'foo' exists it connects to it, if not it raises an exception (this meets dask-ctl expectations)
cluster = KubeCluster(name="foo", create_mode="CONNECT_ONLY") # Equivalent to 'from_name'
cluster = KubeCluster(name="foo", create_mode="CREATE_ONLY") # If 'foo' exists it raises an exception, if not it creates it
I totally understand, ultimately these are just all bits of Dask but we avoid installing them all by default to reduce bloat in the core package. But I understand that separate namespaces can add friction. Do you see this problem being more of an install-time issue or import-time issue? Would making |
Cool! Thank you for being patient, this is clear guidance I can get behind 🙂 let's close this PR and push for realising the ideal ecosystem then.
Your reading is correct and I think my understanding of dask here is lacking, if I had multiple users in a HPC environment what are the advantages/disadvantages of having them connect to the same Client/Cluster vs creating their own? Naively I thought we would want users to run their own for resiliency (other users can't crash my thing) and to avoid overwhelming the task graph if there are multiple heavy users. Thanks for the dask-kubernetes example - that's clean and flexible!
Sure - I think ~all friction would be removed if dask-ctl were bundled with jobqueue/dask itself at install time and the jobqueue documentation used dask-ctl for cluster templates rather than its own config. @guillaumeeb would you/the other maintainers here accept PRs in this direction? |
It's less about multiple users sharing a single cluster and more about multiple processes sharing a cluster (often sequentially, not at the same time). Workflow managers like Airflow have multiple stages in a pipeline which are each launched as a separate process or allocation with various levels of fan-in and fan-out. It's nice to have the first stage in your pipeline create a Dask cluster and each stage after that reuse the Dask resource and the final stage delete the Dask cluster. It's also very helpful in some failure modes that are not common on HPC. HPC allocations have a wall time, so if you use With the new
I would love to see this. What do you think @guillaumeeb? |
You're a star @jacobtomlinson - I learned a bunch and the motivation behind centralising this management is now crystal clear, thanks for writing this up! |
Well, wow, that's a pretty complete and detailed discussion here. I have to admit I didn't spend a lot of time looking at
|
I'm not suggesting removing any of the existing configuration in My comment was more that the config here shouldn't be expanded to include lifecycle things like cluster templates. Instead we should work together to do this in |
Ok, perfectly fine for me! |
Discussed in #543
This PR allows custom clusters of any type to be specified in the jobqueue yaml
Below is an example config file with a second PBS cluster which uses a different queue called 'gpu'
This can be instantiated with
JobQueueCluster.from_name()
This also works directly with the standard job scheduling systems
The main reasons for this PR are:
#543 was closed because the features provided here can be made available through a dask-ctl specific config yaml. I think this PR is still necessary because implementing autodiscovery of dask-jobqueue clusters in dask-ctl requires some way of 'discovering' possible clusters. With this PR, we can simply iterate over the keys in the jobqueue config and instantiate the specified clusters using
JobQueueCluster.from_name()
.Additionally, the config in dask-jobqueue is less Python specific than the dask-ctl yaml spec - this may be less of a burden for configuration by HPC admins who are not experts in Python.
#95 is similar functionality but implemented at the cluster class level so does not solve point 1 above.
cc @guillaumeeb because you were involved in earlier discussions
Thank you in advance for any time spent reviewing!